Get Ready
library(plyr)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:plyr’:
arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(ggpubr)
Loading required package: ggplot2
Loading required package: magrittr
Attaching package: ‘ggpubr’
The following object is masked from ‘package:plyr’:
mutate
library(tidyr)
Attaching package: ‘tidyr’
The following object is masked from ‘package:magrittr’:
extract
library(ggplot2)
library(lubridate)
package ‘lubridate’ was built under R version 3.6.2
Attaching package: ‘lubridate’
The following objects are masked from ‘package:dplyr’:
intersect, setdiff, union
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
Load data sets
# Load data sets
ny <- read.csv("new_york_city.csv")
wash <- read.csv("washington.csv")
chi <- read.csv("chicago.csv")
Inspect the New York City data set
# Inspect the New York City data set
head(ny)
dim(ny)
[1] 54770 9
colnames(ny)
[1] "X" "Start.Time" "End.Time" "Trip.Duration" "Start.Station" "End.Station" "User.Type"
[8] "Gender" "Birth.Year"
str(ny)
'data.frame': 54770 obs. of 9 variables:
$ X : int 5688089 4096714 2173887 3945638 6208972 1285652 1675753 1692245 2271331 1558339 ...
$ Start.Time : Factor w/ 54568 levels "2017-01-01 00:17:01",..: 45448 32799 17316 31589 49688 10220 13390 13509 18111 12449 ...
$ End.Time : Factor w/ 54562 levels "201","2017-01-01 00:30:56",..: 45432 32783 17295 31567 49668 10204 13364 13505 18092 12422 ...
$ Trip.Duration: int 795 692 1325 703 329 998 478 4038 5132 309 ...
$ Start.Station: Factor w/ 636 levels "","1 Ave & E 16 St",..: 522 406 10 93 5 521 325 309 151 245 ...
$ End.Station : Factor w/ 638 levels "","1 Ave & E 16 St",..: 613 8 362 558 269 107 389 110 151 243 ...
$ User.Type : Factor w/ 3 levels "","Customer",..: 3 3 3 3 3 3 3 3 2 3 ...
$ Gender : Factor w/ 3 levels "","Female","Male": 3 3 3 2 3 3 3 3 1 3 ...
$ Birth.Year : num 1998 1981 1987 1986 1992 ...
summary(ny)
X Start.Time End.Time Trip.Duration Start.Station
Min. : 47 2017-05-11 18:26:10: 3 2017-01-03 08:54:10: 2 Min. : 61.0 Pershing Square North: 592
1st Qu.:1712425 2017-01-04 13:58:24: 2 2017-01-04 17:21:55: 2 1st Qu.: 368.0 W 21 St & 6 Ave : 385
Median :3418634 2017-01-09 09:36:01: 2 2017-01-05 17:25:17: 2 Median : 610.0 Broadway & E 22 St : 383
Mean :3415873 2017-01-21 15:36:56: 2 2017-01-12 08:34:01: 2 Mean : 903.6 E 17 St & Broadway : 380
3rd Qu.:5123382 2017-01-21 17:49:59: 2 2017-01-12 09:41:54: 2 3rd Qu.: 1051.0 West St & Chambers St: 364
Max. :6816152 2017-01-21 20:08:29: 2 2017-01-12 20:34:42: 2 Max. :1088634.0 W 20 St & 11 Ave : 329
(Other) :54757 (Other) :54758 NA's :1 (Other) :52337
End.Station User.Type Gender Birth.Year
Pershing Square North: 556 : 119 : 5410 Min. :1885
E 17 St & Broadway : 445 Customer : 5558 Female:12159 1st Qu.:1970
Broadway & E 22 St : 427 Subscriber:49093 Male :37201 Median :1981
W 21 St & 6 Ave : 365 Mean :1978
W 20 St & 11 Ave : 344 3rd Qu.:1988
W 38 St & 8 Ave : 338 Max. :2001
(Other) :52295 NA's :5218
Inspect the Washington D.C. data set
# Inspect the Washington D.C. data set
head(wash)
dim(wash)
[1] 89051 7
colnames(wash)
[1] "X" "Start.Time" "End.Time" "Trip.Duration" "Start.Station" "End.Station" "User.Type"
str(wash)
'data.frame': 89051 obs. of 7 variables:
$ X : int 1621326 482740 1330037 665458 1481135 1148202 1594275 1601832 574182 327058 ...
$ Start.Time : Factor w/ 81223 levels "","2017-01-01 00:11:00",..: 74753 19510 59964 26708 67716 50891 73381 73775 23142 13333 ...
$ End.Time : Factor w/ 81217 levels "","2017-01-01 00:14:00",..: 74744 19473 59981 26732 67753 50918 73397 73775 23114 13350 ...
$ Trip.Duration: num 489 403 637 1827 1549 ...
$ Start.Station: Factor w/ 478 levels "","10th & E St NW",..: 27 478 66 221 278 84 368 82 71 60 ...
$ End.Station : Factor w/ 479 levels "","10th & E St NW",..: 47 219 144 312 315 239 162 376 51 308 ...
$ User.Type : Factor w/ 3 levels "","Customer",..: 3 3 3 2 3 3 3 3 3 3 ...
summary(wash)
X Start.Time End.Time Trip.Duration
Min. : 7 2017-02-19 12:19:00: 6 2017-03-09 17:54:00: 7 Min. : 60.3
1st Qu.: 434587 2017-02-20 11:35:00: 6 2017-03-28 18:11:00: 7 1st Qu.: 410.9
Median : 872858 2017-02-24 17:46:00: 6 2017-01-13 17:48:00: 6 Median : 707.0
Mean : 873881 2017-03-01 08:20:00: 6 2017-01-31 08:49:00: 6 Mean : 1234.0
3rd Qu.:1313305 2017-03-02 08:39:00: 6 2017-02-13 18:09:00: 6 3rd Qu.: 1233.2
Max. :1751392 2017-03-09 17:31:00: 6 2017-02-20 11:38:00: 6 Max. :904591.4
(Other) :89015 (Other) :89013 NA's :1
Start.Station End.Station User.Type
Columbus Circle / Union Station : 1700 Columbus Circle / Union Station : 1767 : 1
Lincoln Memorial : 1546 Jefferson Dr & 14th St SW : 1603 Customer :23450
Jefferson Dr & 14th St SW : 1488 Lincoln Memorial : 1514 Subscriber:65600
Massachusetts Ave & Dupont Circle NW: 1219 Massachusetts Ave & Dupont Circle NW : 1344
Jefferson Memorial : 1068 Smithsonian-National Mall / Jefferson Dr & 12th St SW: 1103
15th & P St NW : 1040 15th & P St NW : 1077
(Other) :80990 (Other) :80643
Inspect the Chicago data set
# Inspect the Chicago data set
head(chi)
dim(chi)
[1] 8630 9
colnames(chi)
[1] "X" "Start.Time" "End.Time" "Trip.Duration" "Start.Station" "End.Station" "User.Type"
[8] "Gender" "Birth.Year"
str(chi)
'data.frame': 8630 obs. of 9 variables:
$ X : int 1423854 955915 9031 304487 45207 1473887 961916 65924 606841 135470 ...
$ Start.Time : Factor w/ 8624 levels "2017-01-01 00:40:14",..: 7876 5303 73 1721 267 8173 5347 368 3376 795 ...
$ End.Time : Factor w/ 8625 levels "2017-01-01 00:46:32",..: 7876 5303 73 1722 267 8173 5346 368 3376 796 ...
$ Trip.Duration: int 321 1610 416 350 534 586 281 723 689 493 ...
$ Start.Station: Factor w/ 472 levels "2112 W Peterson Ave",..: 468 424 291 80 103 119 22 255 374 420 ...
$ End.Station : Factor w/ 471 levels "","2112 W Peterson Ave",..: 132 381 469 409 151 70 467 251 200 118 ...
$ User.Type : Factor w/ 3 levels "","Customer",..: 3 3 3 3 3 3 3 2 3 3 ...
$ Gender : Factor w/ 3 levels "","Female","Male": 3 2 3 3 3 3 2 1 3 3 ...
$ Birth.Year : num 1992 1992 1981 1986 1975 ...
summary(chi)
X Start.Time End.Time Trip.Duration Start.Station
Min. : 36 2017-01-24 07:40:32: 2 2017-04-16 13:16:52: 2 Min. : 60.0 Streeter Dr & Grand Ave : 210
1st Qu.: 386722 2017-04-22 13:16:25: 2 2017-04-26 16:29:26: 2 1st Qu.: 394.2 Lake Shore Dr & Monroe St : 140
Median : 773554 2017-05-27 15:17:50: 2 2017-05-21 16:20:56: 2 Median : 670.0 Clinton St & Washington Blvd: 120
Mean : 776721 2017-06-10 13:29:41: 2 2017-05-27 09:58:21: 2 Mean : 937.2 Clinton St & Madison St : 102
3rd Qu.:1171266 2017-06-20 17:05:11: 2 2017-06-25 14:51:35: 2 3rd Qu.: 1119.0 Canal St & Adams St : 101
Max. :1551248 2017-06-21 13:18:52: 2 2017-01-01 00:46:32: 1 Max. :85408.0 Michigan Ave & Oak St : 98
(Other) :8618 (Other) :8619 (Other) :7859
End.Station User.Type Gender Birth.Year
Streeter Dr & Grand Ave : 233 : 1 :1748 Min. :1899
Clinton St & Madison St : 145 Customer :1746 Female:1723 1st Qu.:1975
Theater on the Lake : 131 Subscriber:6883 Male :5159 Median :1984
Lake Shore Dr & Monroe St : 115 Mean :1981
Clinton St & Washington Blvd: 109 3rd Qu.:1989
Lake Shore Dr & North Blvd : 102 Max. :2002
(Other) :7795 NA's :1747
Preparations (Joining the Data Sets)
Before joining the data sets, I include the variable city for each of the original data sets in order to be able to identify which city the observations belong to later on. In addition, I also want to exclude all observations with missing values.
Built function to include variable city and exclude all observations with missing values
# Built function to include variable city and exclude all observations with missing values
city_omit <- function(x) {
city_name <- deparse(substitute(x))
x%>%
mutate(city = city_name)%>%
na.omit()
}
Use function to include variable city and exclude all observations with missing values for each city
# Use function to include variable city and exclude all observations with missing values for each city
ny <- city_omit(ny)
wash <- city_omit(wash)
chi <- city_omit(chi)
Check results of the function
# Check results of the function
head(ny)
dim(ny)
[1] 49552 10
head(wash)
dim(wash)
[1] 89050 8
head(chi)
dim(chi)
[1] 6883 10
Use plyr’s rbind.fill() function to join all three data sets (Union)
# Use plyr's rbind.fill() function to join all three data sets (Union)
bikeshare <- rbind.fill(ny, wash, chi)
head(bikeshare)
tail(bikeshare)
Build function to check if the sizes of joined data sets are equal
### Build function to check if sizes of joined data sets are equal
check <- function(x, y) {
ifelse(x == y, print("The size of the data sets is equal"), print("Error"))
}
Quickly check if the size of the joined data set is equal to the sum of the individual data sets
# Quickly check if the size of the joined data set is equal to the sum of the individual data sets
x1 <- nrow(bikeshare)
y1 <- nrow(ny) + nrow(wash) + nrow(chi)
check(x1, y1)
[1] "The size of the data sets is equal"
[1] "The size of the data sets is equal"
Add id column to uniquely identify each observation
# Add id column to uniquely identify each observation
bikeshare$id <- seq.int(nrow(bikeshare))
head(bikeshare)
tail(bikeshare)